In [1]:
%matplotlib inline
import sys
print(sys.version)
import numpy as np
print(np.__version__)
import pandas as pd
print(pd.__version__)
import matplotlib.pyplot as plt
Alright in this video we're going to dive a bit deeper into grouping and grouping like functions. These are certainly more complicated topics but my hope is that it should give you a strong intuition to what you can do with python.
The data set that we're going to work with is a slice of http://stat-computing.org/dataexpo/2009/
In [2]:
pd.read_csv("../data/june_airplane_data.csv", header=False).head()
Out[2]:
In [3]:
pd.read_html("http://stat-computing.org/dataexpo/2009/the-data.html")[1]
Out[3]:
In [4]:
pd.read_html("http://stat-computing.org/dataexpo/2009/the-data.html")[1][1]
Out[4]:
In [5]:
df = pd.read_csv('../data/june_airplane_data.csv', names=pd.read_html("http://stat-computing.org/dataexpo/2009/the-data.html")[1][1])
In [6]:
df.head()
Out[6]:
Unfortunately we can see that our columns are off by one, we don't have the name column however the fix is straight forward, all that we need to do is shift over our names list by one
In [7]:
df = pd.read_csv('../data/june_airplane_data.csv', names=pd.read_html("http://stat-computing.org/dataexpo/2009/the-data.html")[1][1][1:])
In [8]:
df.head()
Out[8]:
In [9]:
df.describe()
Out[9]:
Now we've got a new issue, we've got a problem with our date time that we'd like to merge. There are two ways to do this, I'll show you the easiest one - the other one would involve combining them into one column then parsing the results. Let's do it the easier way with read_csv. We can specify parse dates as a dictionary where we map the columns to a date time list.
In [10]:
df = pd.read_csv('../data/june_airplane_data.csv', parse_dates={'Date':['Year','Month','DayofMonth']}, names=pd.read_html("http://stat-computing.org/dataexpo/2009/the-data.html")[1][1][1:])
That will automatically parse it for us and print it out correctly. We can see that it's a bit slow because it's going to try and read all those columns exactly.
In [11]:
df.head()
Out[11]:
Now one that can be helpful in analysis is binning, we've gotten our dates sorted out, but we might want to take a look at some specific groups of data. That's where binning comes in. Let's give it a try. Now this is different than grouping because we're not working on a continuous scale so we've got to be a bit arbitary.
Now let's start by setting up something to group things by time. We can see we've got a departure time and arrival time column that is going to be useful. Right now they're in Army time format. I think it'd be nice to break them up into 6 hour chunks, from midnight to 6 am, 6 to noon, noon to 1800, then 1800 to midnight
I'll start by creating a range of values, these are basically the edges of my bins.
In [12]:
ranges = [0,600,1200,1800,2400]
labels = ['Early Morning','Morning','Early Afternoon','Evening']
Now we use the cut function to cut them up into groups.
In [13]:
?pd.cut()
In [14]:
df['DepTime2'] = pd.cut(df.DepTime, ranges, labels=labels).astype('category')
Now we can do the same with the arrival time, since it's more or less the same
In [15]:
df['ArrTime2'] = pd.cut(df.ArrTime, ranges, labels=labels).astype('category')
In [16]:
df.ArrTime2
Out[16]:
Now categorical data is different that what we're used to, we can see that it has a hierarchy. There's things that are less than other things and therefore is ordered. This can come in handy in certain types of analysis.
Really what this allows us to do is describe the data in terms of those categories
In [17]:
df[['DepTime2','ArrTime2']].describe()
Out[17]:
Let's run a group by to see answer a specific question. When flights leave and arrive, what is the average delay time? Now these are for actual departure and arrival times, not the scheduled ones. That's a different time entirely.
IE. If a flight leaves in the early morning and arrives in the early morning, what is the typical delay? Is there a longer departure delay or arrival delay?
We can construct this with a groupby, where we group our new categories together, then select the relevant columns and get the average.
In [18]:
df.groupby(['DepTime2', 'ArrTime2'])[['DepDelay','ArrDelay']].mean()
Out[18]:
Now when I first say this data, I asked myself why am I getting those NaN values. And This is due to the fact that I have a sample of the data, not all of it. In this dataset, I don't have a any flights that take off in the early morning and land in the Early Afternoon.
We can run a sanity check as follows
In [19]:
sum((df.DepTime < 600) & (df.ArrTime > 1200) & (df.ArrTime < 1800))
Out[19]:
In [20]:
df.groupby(['DepTime2', 'ArrTime2']).agg({'ArrDelay':[np.mean, np.std], 'DepDelay':[np.mean, np.std]})
Out[20]:
Now that we've explored binning, let's try and answer another question, which flights are worst for delays. How can we rank the flights by those that have, on average, the worst delay time. Which, flight, in each departure time group has the worse delay time.
So let's go back to the data.
In [21]:
df.head()
Out[21]:
first we're going to need a way of ranking the flights by the average delay time. Basically we'll take in a dataframe of all the flights in a set and return a dataframe that has a ranked column.
First We'll need a total delay column then we'll have to sort it by the total delay time.
I'm also going to create a new column which is the scheduled departure time so we can see when planes are supposed to go out and group it by those.
In [22]:
df['TotalDelay'] = df.ArrDelay + df.DepDelay
df['ScheduledDepTime'] = pd.cut(df.CRSDepTime, ranges, labels=labels)
In [23]:
def average_delay(dataframe):
dataframe['AvgFlightDelay'] = dataframe.TotalDelay.mean()
return dataframe
In [24]:
avg_delayed = df.groupby('FlightNum').apply(average_delay)
worst ranking...
In [25]:
def ranking(dataframe, column):
dataframe.sort(column, ascending=False,inplace=True)
dataframe[column + "Rank"] = np.arange(len(dataframe)) + 1
return dataframe
In [26]:
avg_delayed_ranked = avg_delayed.groupby('ScheduledDepTime').apply(lambda x: ranking(x, 'AvgFlightDelay'))
In [27]:
avg_delayed_ranked[avg_delayed_ranked.AvgFlightDelayRank == 1]
Out[27]:
We can't just do it on our entire data frame right now because we've got a mix of a lot of different flights, we've got to group it by each flight then apply the ranker
Now we're going to first get the average flight delay for each flight
In [28]:
zscore = lambda x: (x - x.mean()) / x.std()
In [29]:
df.groupby('DayOfWeek').transform(zscore)
Out[29]: